The dataset for this project contains delayed flights information for US airlines from the year 2003 to 2008.
The focus of this project is centered around using data to create insightful visualization from which findings can be made and communicated.
The primary variables of focus are the departure delays and arrival delays of the flights which would be used in conjuction with other variables like the days of the week, months of the year and the various years under investigation.
The result of the analysis after properly wrangling the data would be used to answer questions like:
Trends would be displayed on graphs/charts and conclusions drawn based on the results.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
import matplotlib.patches as mpatches
import datetime
from scipy import stats
from IPython.core.display import HTML
%matplotlib inline
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
The original dataset contains data from 1987 till 2008 and has a size of approximately 12gb which was too large for the machine used in this analysis, so focus was shifted to data beginning from the new millenium.
#Load the combined flight data from year 2003 to 2008 which was merged in another notebook
flight_df = pd.read_csv('df_2003_2008.csv')
#View the first set of rows at the top
flight_df.head()
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003 | 1 | 29 | 3 | 1651.0 | 1655 | 1912.0 | 1913 | UA | 1017 | ... | 5.0 | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 2003 | 1 | 30 | 4 | 1654.0 | 1655 | 1910.0 | 1913 | UA | 1017 | ... | 2.0 | 26.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 2003 | 1 | 31 | 5 | 1724.0 | 1655 | 1936.0 | 1913 | UA | 1017 | ... | 5.0 | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 2003 | 1 | 1 | 3 | 1033.0 | 1035 | 1625.0 | 1634 | UA | 1018 | ... | 6.0 | 11.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 2003 | 1 | 2 | 4 | 1053.0 | 1035 | 1726.0 | 1634 | UA | 1018 | ... | 13.0 | 46.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
#View the last set of rows at the end
flight_df.tail()
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37742755 | 2008 | 4 | 17 | 4 | 1025.0 | 1025 | 1234.0 | 1237 | DL | 1207 | ... | 5.0 | 16.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742756 | 2008 | 4 | 17 | 4 | 1319.0 | 1320 | 1527.0 | 1524 | DL | 1208 | ... | 9.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742757 | 2008 | 4 | 17 | 4 | 1335.0 | 1335 | 1556.0 | 1553 | DL | 1209 | ... | 7.0 | 31.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742758 | 2008 | 4 | 17 | 4 | 1933.0 | 1935 | 2140.0 | 2141 | DL | 1210 | ... | 9.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742759 | 2008 | 4 | 17 | 4 | 621.0 | 615 | 752.0 | 754 | DL | 1211 | ... | 15.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
#view some random samples of rows
flight_df.sample(20)
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6374093 | 2003 | 12 | 14 | 7 | 708.0 | 713 | 836.0 | 833 | AA | 1121 | ... | 8.0 | 23.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 6062225 | 2003 | 12 | 11 | 4 | 1140.0 | 1140 | 1417.0 | 1425 | WN | 300 | ... | 4.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5451262 | 2003 | 11 | 11 | 2 | 1533.0 | 1535 | 1652.0 | 1655 | US | 1109 | ... | 4.0 | 12.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 33490268 | 2007 | 9 | 6 | 4 | 1709.0 | 1659 | 1936.0 | 1925 | DL | 1401 | ... | 14.0 | 9.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 10844559 | 2004 | 8 | 18 | 3 | 820.0 | 820 | 852.0 | 859 | OO | 3947 | ... | 3.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5963527 | 2003 | 12 | 10 | 3 | 2040.0 | 1930 | 2152.0 | 2050 | US | 631 | ... | 3.0 | 11.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 62.0 |
| 8128528 | 2004 | 3 | 31 | 3 | 1106.0 | 1115 | 1248.0 | 1258 | AA | 2006 | ... | 3.0 | 6.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 8561888 | 2004 | 4 | 25 | 7 | 1750.0 | 1750 | 2013.0 | 2021 | EV | 4199 | ... | 8.0 | 20.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 13943267 | 2005 | 1 | 16 | 7 | 2055.0 | 2055 | 2220.0 | 2158 | EV | 4105 | ... | 4.0 | 43.0 | 0 | NaN | 0 | 0.0 | 0.0 | 22.0 | 0.0 | 0.0 |
| 8420483 | 2004 | 4 | 16 | 5 | 1500.0 | 1505 | 1621.0 | 1637 | OH | 5264 | ... | 5.0 | 10.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4773174 | 2003 | 9 | 15 | 1 | 940.0 | 942 | 1130.0 | 1128 | AA | 2245 | ... | 3.0 | 30.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 27583781 | 2006 | 12 | 30 | 6 | 731.0 | 730 | 833.0 | 844 | US | 1079 | ... | 2.0 | 9.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 6438826 | 2003 | 12 | 9 | 2 | 944.0 | 930 | 1322.0 | 1249 | CO | 548 | ... | 4.0 | 43.0 | 0 | NaN | 0 | 14.0 | 0.0 | 19.0 | 0.0 | 0.0 |
| 21335233 | 2006 | 1 | 27 | 5 | 2046.0 | 2050 | 2208.0 | 2214 | DL | 440 | ... | 4.0 | 14.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 15739000 | 2005 | 4 | 28 | 4 | 1002.0 | 1010 | 1349.0 | 1320 | HA | 7 | ... | 3.0 | 31.0 | 0 | NaN | 0 | 29.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 20870572 | 2006 | 1 | 29 | 7 | 1610.0 | 1610 | 1712.0 | 1720 | WN | 2504 | ... | 7.0 | 15.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 17501995 | 2005 | 7 | 26 | 2 | 841.0 | 845 | 1721.0 | 1728 | UA | 172 | ... | 7.0 | 28.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 28007779 | 2007 | 1 | 20 | 6 | 1735.0 | 1745 | 1812.0 | 1821 | XE | 2651 | ... | 5.0 | 13.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 36550744 | 2008 | 3 | 10 | 1 | 1204.0 | 1125 | 1205.0 | 1140 | WN | 713 | ... | 7.0 | 11.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 25.0 |
| 12273609 | 2004 | 10 | 31 | 7 | 858.0 | 900 | 1720.0 | 1726 | NW | 846 | ... | 8.0 | 25.0 | 0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
20 rows × 29 columns
#Check the data types
flight_df.dtypes
Year int64 Month int64 DayofMonth int64 DayOfWeek int64 DepTime float64 CRSDepTime int64 ArrTime float64 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime float64 CRSElapsedTime float64 AirTime float64 ArrDelay float64 DepDelay float64 Origin object Dest object Distance int64 TaxiIn float64 TaxiOut float64 Cancelled int64 CancellationCode object Diverted int64 CarrierDelay float64 WeatherDelay float64 NASDelay float64 SecurityDelay float64 LateAircraftDelay float64 dtype: object
# Check additional info to determine the shape of the dataframe
flight_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37742760 entries, 0 to 37742759 Data columns (total 29 columns): # Column Dtype --- ------ ----- 0 Year int64 1 Month int64 2 DayofMonth int64 3 DayOfWeek int64 4 DepTime float64 5 CRSDepTime int64 6 ArrTime float64 7 CRSArrTime int64 8 UniqueCarrier object 9 FlightNum int64 10 TailNum object 11 ActualElapsedTime float64 12 CRSElapsedTime float64 13 AirTime float64 14 ArrDelay float64 15 DepDelay float64 16 Origin object 17 Dest object 18 Distance int64 19 TaxiIn float64 20 TaxiOut float64 21 Cancelled int64 22 CancellationCode object 23 Diverted int64 24 CarrierDelay float64 25 WeatherDelay float64 26 NASDelay float64 27 SecurityDelay float64 28 LateAircraftDelay float64 dtypes: float64(14), int64(10), object(5) memory usage: 8.2+ GB
flight_df.shape
(37742760, 29)
#Change data types of the columns from int to str in preparation to create another column
flight_df.Year = flight_df.Year.astype(str)
flight_df.Month = flight_df.Month.astype(str)
flight_df.DayofMonth = flight_df.DayofMonth.astype(str)
flight_df.DayOfWeek = flight_df.DayOfWeek.astype(str)
#Check the data types to confirm changes
flight_df.dtypes
Year object Month object DayofMonth object DayOfWeek object DepTime float64 CRSDepTime int64 ArrTime float64 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime float64 CRSElapsedTime float64 AirTime float64 ArrDelay float64 DepDelay float64 Origin object Dest object Distance int64 TaxiIn float64 TaxiOut float64 Cancelled int64 CancellationCode object Diverted int64 CarrierDelay float64 WeatherDelay float64 NASDelay float64 SecurityDelay float64 LateAircraftDelay float64 dtype: object
#Create the date column
flight_df['Date'] = flight_df.Year + '-' + flight_df.Month + '-' + flight_df.DayofMonth
flight_df.Date.head()
0 2003-1-29 1 2003-1-30 2 2003-1-31 3 2003-1-1 4 2003-1-2 Name: Date, dtype: object
flight_df.head()
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003 | 1 | 29 | 3 | 1651.0 | 1655 | 1912.0 | 1913 | UA | 1017 | ... | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 2003-1-29 |
| 1 | 2003 | 1 | 30 | 4 | 1654.0 | 1655 | 1910.0 | 1913 | UA | 1017 | ... | 26.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 2003-1-30 |
| 2 | 2003 | 1 | 31 | 5 | 1724.0 | 1655 | 1936.0 | 1913 | UA | 1017 | ... | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 2003-1-31 |
| 3 | 2003 | 1 | 1 | 3 | 1033.0 | 1035 | 1625.0 | 1634 | UA | 1018 | ... | 11.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 2003-1-1 |
| 4 | 2003 | 1 | 2 | 4 | 1053.0 | 1035 | 1726.0 | 1634 | UA | 1018 | ... | 46.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN | 2003-1-2 |
5 rows × 30 columns
#Convert date column to datetime
flight_df['Date'] = pd.to_datetime(flight_df['Date'])
# Move the date column to be the first column
date_column = flight_df.pop('Date')
# insert column with insert(location, column_name, column_value)
flight_df.insert(0, 'Date', date_column)
flight_df.head()
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003-01-29 | 2003 | 1 | 29 | 3 | 1651.0 | 1655 | 1912.0 | 1913 | UA | ... | 5.0 | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 2003-01-30 | 2003 | 1 | 30 | 4 | 1654.0 | 1655 | 1910.0 | 1913 | UA | ... | 2.0 | 26.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 2003-01-31 | 2003 | 1 | 31 | 5 | 1724.0 | 1655 | 1936.0 | 1913 | UA | ... | 5.0 | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 2003-01-01 | 2003 | 1 | 1 | 3 | 1033.0 | 1035 | 1625.0 | 1634 | UA | ... | 6.0 | 11.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 2003-01-02 | 2003 | 1 | 2 | 4 | 1053.0 | 1035 | 1726.0 | 1634 | UA | ... | 13.0 | 46.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 30 columns
#verify the date data type
flight_df.dtypes
Date datetime64[ns] Year object Month object DayofMonth object DayOfWeek object DepTime float64 CRSDepTime int64 ArrTime float64 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime float64 CRSElapsedTime float64 AirTime float64 ArrDelay float64 DepDelay float64 Origin object Dest object Distance int64 TaxiIn float64 TaxiOut float64 Cancelled int64 CancellationCode object Diverted int64 CarrierDelay float64 WeatherDelay float64 NASDelay float64 SecurityDelay float64 LateAircraftDelay float64 dtype: object
#Change the values in the month column from numbers to actual months
flight_df['Month'] = flight_df['Month'].replace(
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'],
['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December']
)
#Change the values in the day of week column from numbers to actual week days
flight_df['DayOfWeek'] = flight_df['DayOfWeek'].replace(['1', '2', '3', '4', '5', '6', '7'],
['Monday', 'Tuesday', 'Wednesday', 'Thursday',
'Friday', 'Saturday', 'Sunday']
)
flight_df.head()
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003-01-29 | 2003 | January | 29 | Wednesday | 1651.0 | 1655 | 1912.0 | 1913 | UA | ... | 5.0 | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 2003-01-30 | 2003 | January | 30 | Thursday | 1654.0 | 1655 | 1910.0 | 1913 | UA | ... | 2.0 | 26.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 2003-01-31 | 2003 | January | 31 | Friday | 1724.0 | 1655 | 1936.0 | 1913 | UA | ... | 5.0 | 17.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 2003-01-01 | 2003 | January | 1 | Wednesday | 1033.0 | 1035 | 1625.0 | 1634 | UA | ... | 6.0 | 11.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 2003-01-02 | 2003 | January | 2 | Thursday | 1053.0 | 1035 | 1726.0 | 1634 | UA | ... | 13.0 | 46.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 30 columns
flight_df.tail()
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37742755 | 2008-04-17 | 2008 | April | 17 | Thursday | 1025.0 | 1025 | 1234.0 | 1237 | DL | ... | 5.0 | 16.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742756 | 2008-04-17 | 2008 | April | 17 | Thursday | 1319.0 | 1320 | 1527.0 | 1524 | DL | ... | 9.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742757 | 2008-04-17 | 2008 | April | 17 | Thursday | 1335.0 | 1335 | 1556.0 | 1553 | DL | ... | 7.0 | 31.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742758 | 2008-04-17 | 2008 | April | 17 | Thursday | 1933.0 | 1935 | 2140.0 | 2141 | DL | ... | 9.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 37742759 | 2008-04-17 | 2008 | April | 17 | Thursday | 621.0 | 615 | 752.0 | 754 | DL | ... | 15.0 | 12.0 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 30 columns
#Change these columns to categorical
cols = ['Year', 'Month', 'DayOfWeek', 'Cancelled', 'Diverted', 'CancellationCode']
flight_df[cols] = flight_df[cols].astype('category')
flight_df.dtypes
Date datetime64[ns] Year category Month category DayofMonth object DayOfWeek category DepTime float64 CRSDepTime int64 ArrTime float64 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime float64 CRSElapsedTime float64 AirTime float64 ArrDelay float64 DepDelay float64 Origin object Dest object Distance int64 TaxiIn float64 TaxiOut float64 Cancelled category CancellationCode category Diverted category CarrierDelay float64 WeatherDelay float64 NASDelay float64 SecurityDelay float64 LateAircraftDelay float64 dtype: object
#Convert all columns with float data types to integer
#first fill the cells that contain NA with 0
cols2 = ['DepTime', 'ArrTime', 'ActualElapsedTime', 'AirTime', 'CRSElapsedTime', 'ArrDelay', 'DepDelay', 'TaxiIn',
'TaxiOut', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
flight_df[cols2] = flight_df[cols2].fillna(0)
#Convert the columns with float data types to integers for sake of space conservation
flight_df[cols2] = flight_df[cols2].astype(int)
flight_df.dtypes
Date datetime64[ns] Year category Month category DayofMonth object DayOfWeek category DepTime int32 CRSDepTime int64 ArrTime int32 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime int32 CRSElapsedTime int32 AirTime int32 ArrDelay int32 DepDelay int32 Origin object Dest object Distance int64 TaxiIn int32 TaxiOut int32 Cancelled category CancellationCode category Diverted category CarrierDelay int32 WeatherDelay int32 NASDelay int32 SecurityDelay int32 LateAircraftDelay int32 dtype: object
flight_df.head()
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003-01-29 | 2003 | January | 29 | Wednesday | 1651 | 1655 | 1912 | 1913 | UA | ... | 5 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2003-01-30 | 2003 | January | 30 | Thursday | 1654 | 1655 | 1910 | 1913 | UA | ... | 2 | 26 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2003-01-31 | 2003 | January | 31 | Friday | 1724 | 1655 | 1936 | 1913 | UA | ... | 5 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2003-01-01 | 2003 | January | 1 | Wednesday | 1033 | 1035 | 1625 | 1634 | UA | ... | 6 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2003-01-02 | 2003 | January | 2 | Thursday | 1053 | 1035 | 1726 | 1634 | UA | ... | 13 | 46 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 30 columns
#Check statistics description of the data
flight_df.describe(datetime_is_numeric=True)
| Date | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37742760 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 | 3.774276e+07 |
| mean | 2005-09-20 05:13:02.050533888 | 1.317565e+03 | 1.336018e+03 | 1.460449e+03 | 1.500471e+03 | 2.113649e+03 | 1.222718e+02 | 1.259891e+02 | 1.006900e+02 | 7.350166e+00 | 8.746864e+00 | 7.198902e+02 | 7.204859e+00 | 1.570925e+01 | 2.944031e+00 | 6.371962e-01 | 3.275821e+00 | 2.242123e-02 | 3.796223e+00 |
| min | 2003-01-01 00:00:00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | -7.100000e+02 | -1.240000e+03 | -3.818000e+03 | -1.302000e+03 | -1.410000e+03 | 6.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -6.000000e+01 | 0.000000e+00 | 0.000000e+00 |
| 25% | 2004-06-02 00:00:00 | 9.200000e+02 | 9.300000e+02 | 1.059000e+03 | 1.120000e+03 | 5.870000e+02 | 7.200000e+01 | 7.500000e+01 | 5.200000e+01 | -9.000000e+00 | -4.000000e+00 | 3.130000e+02 | 4.000000e+00 | 1.000000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 50% | 2005-09-22 00:00:00 | 1.322000e+03 | 1.325000e+03 | 1.508000e+03 | 1.522000e+03 | 1.457000e+03 | 1.040000e+02 | 1.070000e+02 | 8.300000e+01 | -1.000000e+00 | 0.000000e+00 | 5.580000e+02 | 5.000000e+00 | 1.300000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 75% | 2007-01-21 00:00:00 | 1.729000e+03 | 1.724000e+03 | 1.908000e+03 | 1.910000e+03 | 3.220000e+03 | 1.540000e+02 | 1.550000e+02 | 1.320000e+02 | 1.100000e+01 | 7.000000e+00 | 9.480000e+02 | 8.000000e+00 | 1.800000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| max | 2008-04-30 00:00:00 | 2.930000e+03 | 2.400000e+03 | 2.955000e+03 | 2.400000e+03 | 9.912000e+03 | 1.879000e+03 | 1.441000e+03 | 3.508000e+03 | 2.598000e+03 | 2.601000e+03 | 4.962000e+03 | 1.523000e+03 | 3.905000e+03 | 2.580000e+03 | 1.510000e+03 | 1.392000e+03 | 5.330000e+02 | 1.407000e+03 |
| std | NaN | 5.046890e+02 | 4.626217e+02 | 5.371513e+02 | 4.780545e+02 | 1.942240e+03 | 7.222935e+01 | 6.977545e+01 | 7.821468e+01 | 3.501965e+01 | 3.180393e+01 | 5.681203e+02 | 3.447001e+01 | 1.205153e+01 | 1.773918e+01 | 8.420635e+00 | 1.481922e+01 | 1.096441e+00 | 1.807507e+01 |
The summary statistics indicate that there are departure delays that were as high as 2598mins (approx. 44hours) and flights that departed as much as 1410mins (approx. 24hours) earlier than scheduled.
These are considered outliers and we would accept a maximum of 10hours delay and 1hour early departures
#Check for flights that have departure delays greater than 10hrs (600mins)
flight_df[flight_df['DepDelay']>=600]
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20417 | 2003-01-04 | 2003 | January | 4 | Saturday | 915 | 1930 | 1143 | 2200 | UA | ... | 5 | 33 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 143166 | 2003-01-07 | 2003 | January | 7 | Tuesday | 1854 | 615 | 2019 | 745 | MQ | ... | 7 | 18 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 151706 | 2003-01-03 | 2003 | January | 3 | Friday | 706 | 1716 | 910 | 1858 | NW | ... | 15 | 25 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 156105 | 2003-01-31 | 2003 | January | 31 | Friday | 816 | 1300 | 1220 | 1706 | NW | ... | 6 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 157883 | 2003-01-02 | 2003 | January | 2 | Thursday | 601 | 1355 | 918 | 1708 | NW | ... | 14 | 16 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37699372 | 2008-04-13 | 2008 | April | 13 | Sunday | 1642 | 630 | 2156 | 1145 | CO | ... | 6 | 11 | 0 | NaN | 0 | 611 | 0 | 0 | 0 | 0 |
| 37704761 | 2008-04-21 | 2008 | April | 21 | Monday | 2115 | 900 | 2228 | 1018 | CO | ... | 8 | 9 | 0 | NaN | 0 | 730 | 0 | 0 | 0 | 0 |
| 37715222 | 2008-04-03 | 2008 | April | 3 | Thursday | 1946 | 910 | 44 | 1422 | CO | ... | 8 | 13 | 0 | NaN | 0 | 622 | 0 | 0 | 0 | 0 |
| 37720671 | 2008-04-23 | 2008 | April | 23 | Wednesday | 2315 | 1125 | 632 | 1917 | CO | ... | 10 | 9 | 0 | NaN | 0 | 675 | 0 | 0 | 0 | 0 |
| 37734861 | 2008-04-11 | 2008 | April | 11 | Friday | 2336 | 1140 | 719 | 2012 | DL | ... | 5 | 11 | 0 | NaN | 0 | 667 | 0 | 0 | 0 | 0 |
4401 rows × 30 columns
#Check for flights that have actual departure earlier than schedule by 2hrs (120mins)
flight_df[flight_df['DepDelay']<=-60]
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 71636 | 2003-01-07 | 2003 | January | 7 | Tuesday | 600 | 530 | 655 | 625 | WN | ... | 3 | 15 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 487694 | 2003-01-31 | 2003 | January | 31 | Friday | 2135 | 700 | 2329 | 903 | AS | ... | 6 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 788981 | 2003-02-05 | 2003 | February | 5 | Wednesday | 1310 | 1420 | 1721 | 1728 | UA | ... | 5 | 84 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 870919 | 2003-02-17 | 2003 | February | 17 | Monday | 945 | 1050 | 1047 | 1155 | EV | ... | 2 | 10 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 999449 | 2003-02-10 | 2003 | February | 10 | Monday | 50 | 1910 | 150 | 2025 | B6 | ... | 8 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 36462698 | 2008-02-09 | 2008 | February | 9 | Saturday | 1645 | 1804 | 2005 | 2058 | AS | ... | 4 | 30 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 36697577 | 2008-03-31 | 2008 | March | 31 | Monday | 1945 | 2055 | 2130 | 2155 | OH | ... | 15 | 28 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 36699720 | 2008-03-02 | 2008 | March | 2 | Sunday | 838 | 945 | 921 | 925 | OO | ... | 4 | 10 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 36893442 | 2008-03-10 | 2008 | March | 10 | Monday | 1411 | 1513 | 2046 | 2045 | FL | ... | 15 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 37082118 | 2008-03-08 | 2008 | March | 8 | Saturday | 1550 | 1700 | 1803 | 1820 | B6 | ... | 4 | 71 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
570 rows × 30 columns
#Delete rows greater than 600mins departure delays
flight_df = flight_df[flight_df['DepDelay']<=600]
#Delete rows with flights departure 60mins earlier than schedule
flight_df = flight_df[flight_df['DepDelay']>=-60]
flight_df.shape
(37737887, 30)
#set the correct categorical order of the months and weekdays
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
category_day = pd.api.types.CategoricalDtype(categories=weekday, ordered=True)
category_month = pd.api.types.CategoricalDtype(categories=months, ordered=True)
flight_df['DayOfWeek'] = flight_df['DayOfWeek'].astype(category_day)
flight_df['Month'] = flight_df['Month'].astype(category_month)
flight_df.dtypes
Date datetime64[ns] Year category Month category DayofMonth object DayOfWeek category DepTime int32 CRSDepTime int64 ArrTime int32 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime int32 CRSElapsedTime int32 AirTime int32 ArrDelay int32 DepDelay int32 Origin object Dest object Distance int64 TaxiIn int32 TaxiOut int32 Cancelled category CancellationCode category Diverted category CarrierDelay int32 WeatherDelay int32 NASDelay int32 SecurityDelay int32 LateAircraftDelay int32 dtype: object
#Save the cleaned data
flight_df.to_csv('flights_2003_2008_wrangled_rev.csv', index=False)
flight_df.head(30)
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003-01-29 | 2003 | January | 29 | Wednesday | 1651 | 1655 | 1912 | 1913 | UA | ... | 5 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2003-01-30 | 2003 | January | 30 | Thursday | 1654 | 1655 | 1910 | 1913 | UA | ... | 2 | 26 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2003-01-31 | 2003 | January | 31 | Friday | 1724 | 1655 | 1936 | 1913 | UA | ... | 5 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2003-01-01 | 2003 | January | 1 | Wednesday | 1033 | 1035 | 1625 | 1634 | UA | ... | 6 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2003-01-02 | 2003 | January | 2 | Thursday | 1053 | 1035 | 1726 | 1634 | UA | ... | 13 | 46 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 2003-01-03 | 2003 | January | 3 | Friday | 1031 | 1035 | 1640 | 1634 | UA | ... | 13 | 13 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 2003-01-04 | 2003 | January | 4 | Saturday | 1031 | 1035 | 1626 | 1634 | UA | ... | 5 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 2003-01-05 | 2003 | January | 5 | Sunday | 1035 | 1035 | 1636 | 1634 | UA | ... | 5 | 9 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 2003-01-06 | 2003 | January | 6 | Monday | 1031 | 1035 | 1653 | 1634 | UA | ... | 7 | 14 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 2003-01-01 | 2003 | January | 1 | Wednesday | 1713 | 1710 | 1851 | 1847 | UA | ... | 7 | 29 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 10 | 2003-01-02 | 2003 | January | 2 | Thursday | 1706 | 1710 | 1824 | 1847 | UA | ... | 7 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | 2003-01-03 | 2003 | January | 3 | Friday | 1707 | 1710 | 1828 | 1847 | UA | ... | 8 | 15 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 2003-01-04 | 2003 | January | 4 | Saturday | 1710 | 1710 | 1835 | 1847 | UA | ... | 8 | 16 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 2003-01-05 | 2003 | January | 5 | Sunday | 1832 | 1710 | 1951 | 1847 | UA | ... | 7 | 15 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2003-01-06 | 2003 | January | 6 | Monday | 1710 | 1710 | 1843 | 1847 | UA | ... | 13 | 22 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 2003-01-07 | 2003 | January | 7 | Tuesday | 1712 | 1710 | 1839 | 1847 | UA | ... | 4 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 16 | 2003-01-08 | 2003 | January | 8 | Wednesday | 1706 | 1710 | 1826 | 1847 | UA | ... | 4 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 2003-01-09 | 2003 | January | 9 | Thursday | 1706 | 1710 | 1820 | 1847 | UA | ... | 6 | 13 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 18 | 2003-01-10 | 2003 | January | 10 | Friday | 1707 | 1710 | 1831 | 1847 | UA | ... | 6 | 22 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 19 | 2003-01-11 | 2003 | January | 11 | Saturday | 1708 | 1710 | 1831 | 1847 | UA | ... | 5 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 20 | 2003-01-12 | 2003 | January | 12 | Sunday | 1704 | 1710 | 1823 | 1847 | UA | ... | 4 | 15 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 21 | 2003-01-13 | 2003 | January | 13 | Monday | 1714 | 1710 | 1835 | 1847 | UA | ... | 7 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 22 | 2003-01-14 | 2003 | January | 14 | Tuesday | 1707 | 1710 | 1836 | 1847 | UA | ... | 7 | 14 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 23 | 2003-01-15 | 2003 | January | 15 | Wednesday | 1709 | 1710 | 1838 | 1847 | UA | ... | 11 | 14 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 24 | 2003-01-16 | 2003 | January | 16 | Thursday | 1711 | 1710 | 1837 | 1847 | UA | ... | 6 | 16 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 25 | 2003-01-17 | 2003 | January | 17 | Friday | 1705 | 1710 | 1837 | 1847 | UA | ... | 10 | 21 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 26 | 2003-01-18 | 2003 | January | 18 | Saturday | 1706 | 1710 | 1824 | 1847 | UA | ... | 6 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 27 | 2003-01-19 | 2003 | January | 19 | Sunday | 1702 | 1710 | 1834 | 1847 | UA | ... | 9 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 28 | 2003-01-20 | 2003 | January | 20 | Monday | 1709 | 1710 | 1822 | 1847 | UA | ... | 5 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 29 | 2003-01-21 | 2003 | January | 21 | Tuesday | 1705 | 1710 | 1827 | 1847 | UA | ... | 3 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
30 rows × 30 columns
flight_df.reset_index(drop=True)
| Date | Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003-01-29 | 2003 | January | 29 | Wednesday | 1651 | 1655 | 1912 | 1913 | UA | ... | 5 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2003-01-30 | 2003 | January | 30 | Thursday | 1654 | 1655 | 1910 | 1913 | UA | ... | 2 | 26 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2003-01-31 | 2003 | January | 31 | Friday | 1724 | 1655 | 1936 | 1913 | UA | ... | 5 | 17 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2003-01-01 | 2003 | January | 1 | Wednesday | 1033 | 1035 | 1625 | 1634 | UA | ... | 6 | 11 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2003-01-02 | 2003 | January | 2 | Thursday | 1053 | 1035 | 1726 | 1634 | UA | ... | 13 | 46 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 37737882 | 2008-04-17 | 2008 | April | 17 | Thursday | 1025 | 1025 | 1234 | 1237 | DL | ... | 5 | 16 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 37737883 | 2008-04-17 | 2008 | April | 17 | Thursday | 1319 | 1320 | 1527 | 1524 | DL | ... | 9 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 37737884 | 2008-04-17 | 2008 | April | 17 | Thursday | 1335 | 1335 | 1556 | 1553 | DL | ... | 7 | 31 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 37737885 | 2008-04-17 | 2008 | April | 17 | Thursday | 1933 | 1935 | 2140 | 2141 | DL | ... | 9 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
| 37737886 | 2008-04-17 | 2008 | April | 17 | Thursday | 621 | 615 | 752 | 754 | DL | ... | 15 | 12 | 0 | NaN | 0 | 0 | 0 | 0 | 0 | 0 |
37737887 rows × 30 columns
#Due to the enormous size of this data, we would be taking a sample to carry out a histogram overview of the numerical data
flight_sample = flight_df.sample(n=50000)
#Plot an histogram of all numerical data
flight_sample.hist(figsize=(16,16));
flight_df.dtypes
Date datetime64[ns] Year category Month category DayofMonth object DayOfWeek category DepTime int32 CRSDepTime int64 ArrTime int32 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime int32 CRSElapsedTime int32 AirTime int32 ArrDelay int32 DepDelay int32 Origin object Dest object Distance int64 TaxiIn int32 TaxiOut int32 Cancelled category CancellationCode category Diverted category CarrierDelay int32 WeatherDelay int32 NASDelay int32 SecurityDelay int32 LateAircraftDelay int32 dtype: object
#Plot a histogram to see the data spread for the duration of time departure flights are delayed
plt.figure(figsize=[13, 8])
binsize = 0.5
bins = np.arange(-20, flight_df['DepDelay'].mean()+60, binsize)
plt.hist(data = flight_df, x = 'DepDelay', bins = bins);
plt.title('Histogram of Departure Delays', fontsize=14, y=1.015)
plt.xlabel('Flight Departure Delays in minutes', labelpad=6, fontsize=12)
plt.ylabel('Delays Count', labelpad=6, fontsize=12)
plt.show()
#Plot a histogram to see the data spread for the amount of time arrival flights are delayed
plt.figure(figsize=[13, 8])
binsize = 0.5
bins = np.arange(-50, flight_df['ArrDelay'].mean()+60, binsize)
plt.hist(data = flight_df, x = 'ArrDelay', bins = bins);
plt.title('Histogram of Arrival Delays', fontsize=14, y=1.015)
plt.xlabel('Flight Arrival Delays in minutes', labelpad=6, fontsize=12)
plt.ylabel('Delays Count', labelpad=6, fontsize=12)
plt.show()
The distribution showed that many flights departed right on time with this making approximately 17% of the flights investigated while only approximately 7% of the flights investigated arrived exactly on time.
The distribution also showed that there were flights that departed and arrived earlier than schedule but there were more flights that experienced delays.
#Perform a count of the flights per year
rcParams['figure.figsize'] = 8,6
sns.countplot(data = flight_df, x = 'Year', color = sns.color_palette()[0]);
plt.title('Flight Counts Per Year', fontsize=18, y=1.015)
plt.xlabel('Year', labelpad=6, fontsize=12)
plt.ylabel('Flight Counts', labelpad=6, fontsize=12);
#Perform a count of the flights per month
rcParams['figure.figsize'] = 12,6
sns.countplot(data = flight_df, x = 'Month', color = sns.color_palette()[0]);
plt.title('Flight Counts Per Month', fontsize=18, y=1.015)
plt.xlabel('Month', labelpad=6, fontsize=12)
plt.ylabel('Flight Counts', labelpad=6, fontsize=12);
The absence of flight data from May 2008 to December 2008 makes it difficult to determine the correct monthly distribution of the studied data
#Perform a count of the flights per day
rcParams['figure.figsize'] = 10,6
sns.countplot(data = flight_df, x = 'DayOfWeek', color = sns.color_palette()[0]);
plt.title('Flight Counts Per Day', fontsize=18, y=1.015)
plt.xlabel('Day', labelpad=6, fontsize=12)
plt.ylabel('Flight Counts', labelpad=6, fontsize=12);
From the above graph, it appears the flight distribution during the weekdays is fairly constant but there is a slight deep in the weekend, especially on Saturdays
flight_sample.dtypes
Date datetime64[ns] Year category Month category DayofMonth object DayOfWeek category DepTime int32 CRSDepTime int64 ArrTime int32 CRSArrTime int64 UniqueCarrier object FlightNum int64 TailNum object ActualElapsedTime int32 CRSElapsedTime int32 AirTime int32 ArrDelay int32 DepDelay int32 Origin object Dest object Distance int64 TaxiIn int32 TaxiOut int32 Cancelled category CancellationCode category Diverted category CarrierDelay int32 WeatherDelay int32 NASDelay int32 SecurityDelay int32 LateAircraftDelay int32 dtype: object
#ploting a scattered plot for all the numeric data using the sample dataframe of 50,000 rows
pd.plotting.scatter_matrix(flight_sample, figsize=(36,26));
#ploting a scattered plot for some of the numeric data using the sample dataframe of 50,000 rows
numeric_vars = ['ActualElapsedTime', 'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Distance', 'CarrierDelay',
'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay']
g = sns.PairGrid(data = flight_sample, vars = numeric_vars);
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter)
plt.title('Scatter Plot of Numeric Data', fontsize=14, y=1.025);
# correlation plot of a sample of the data
plt.figure(figsize = [12, 8])
sns.heatmap(flight_sample[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.title('Correlation Matrix of Data Sample as a Heat Map', fontsize=16, y=1.025)
plt.show()
# correlation plot of entire data
plt.figure(figsize = [12, 8])
sns.heatmap(flight_df[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 0)
plt.title('Correlation Matrix of Entire Data as a Heat Map', fontsize=16, y=1.025)
plt.show()
It can be seen that there is a correlation between the departure delays flights experience and the arrival delays of those flights. Additionally, there is a relationship between the amount of time spent on air to the distance of flight. It can aslo be seen that Late aircraft, the airline and National Airspace systems are major contributing factors to delays
plt.figure(figsize = [12, 8])
plt.scatter(data = flight_df, x ='DepDelay' , y = 'ArrDelay', alpha = 1/10);
plt.title('Scatter Plot of Flight Departure Delays against Arrival Delays', fontsize=20, y=1.015)
plt.xlabel('Departure Delays \n (mins)', labelpad=10, fontsize=16);
plt.ylabel('Arrival Delays \n (mins)', labelpad=10, fontsize=16);
Delayed departure causes delayed arrival for the flights and on time departure results to on time arrival for the flights.
#Create a dataframe of the average delays per month
df_months = flight_df.groupby('Month').mean().reset_index()
df_months
| Month | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | January | 1310.219509 | 1337.594929 | 1455.805386 | 1504.062797 | 2157.704147 | 121.568725 | 126.094244 | 100.288053 | 7.277132 | 8.636745 | 711.348029 | 7.040496 | 15.638199 | 2.786213 | 0.673578 | 3.145929 | 0.017407 | 3.525414 |
| 1 | February | 1307.053541 | 1337.675643 | 1450.394634 | 1503.640999 | 2161.575870 | 121.705141 | 126.450119 | 100.546064 | 8.366226 | 9.421717 | 714.370631 | 7.010972 | 15.621919 | 2.840237 | 0.660985 | 3.196059 | 0.020346 | 3.807352 |
| 2 | March | 1319.751260 | 1337.734161 | 1462.418632 | 1502.017882 | 2150.603762 | 122.994410 | 127.033846 | 101.979353 | 7.033415 | 8.755790 | 719.825833 | 6.930663 | 15.512858 | 2.710515 | 0.485819 | 2.982878 | 0.021594 | 3.665716 |
| 3 | April | 1321.727732 | 1334.745216 | 1469.020260 | 1499.120422 | 2139.841921 | 122.200705 | 125.881644 | 101.464495 | 4.410187 | 6.441712 | 718.267674 | 6.958851 | 15.212726 | 2.281117 | 0.363707 | 2.483381 | 0.018092 | 2.870917 |
| 4 | May | 1325.424717 | 1336.303487 | 1470.902731 | 1500.690300 | 2114.466694 | 121.964057 | 125.027566 | 101.101215 | 5.216839 | 6.827719 | 717.670137 | 7.129669 | 15.476957 | 2.152643 | 0.493377 | 2.699389 | 0.011184 | 2.910383 |
| 5 | June | 1321.671332 | 1337.348605 | 1454.380415 | 1498.229627 | 2097.977341 | 122.908195 | 125.784430 | 100.872881 | 10.478492 | 11.207674 | 726.576325 | 7.700359 | 16.184087 | 3.511726 | 0.964118 | 4.104409 | 0.026997 | 4.956243 |
| 6 | July | 1320.941525 | 1337.609394 | 1451.448223 | 1498.297337 | 2087.331382 | 122.888406 | 126.133281 | 100.662036 | 10.767825 | 11.725772 | 730.249831 | 7.841367 | 16.275591 | 3.674984 | 0.914263 | 4.097322 | 0.023155 | 5.257432 |
| 7 | August | 1320.410830 | 1337.611761 | 1455.786996 | 1498.714731 | 2075.003639 | 122.426184 | 125.768085 | 99.634940 | 8.620705 | 9.716547 | 726.769566 | 7.581895 | 16.119414 | 3.311431 | 0.772425 | 3.564851 | 0.042475 | 4.427208 |
| 8 | September | 1311.737911 | 1333.325800 | 1462.251200 | 1499.824526 | 2095.867843 | 120.179950 | 124.470582 | 98.791670 | 3.266361 | 5.218247 | 715.610378 | 6.944332 | 15.362981 | 2.330280 | 0.426898 | 2.657896 | 0.020122 | 2.602176 |
| 9 | November | 1322.011575 | 1333.410356 | 1470.352219 | 1499.266528 | 2090.919534 | 122.999138 | 126.359708 | 101.165848 | 5.435225 | 7.272674 | 717.715234 | 7.142202 | 15.652659 | 2.585080 | 0.503677 | 3.319499 | 0.016979 | 3.232795 |
| 10 | December | 1310.393426 | 1334.602495 | 1450.961596 | 1500.206187 | 2074.595418 | 123.610752 | 127.829350 | 101.280289 | 11.038630 | 12.054924 | 726.512437 | 7.316818 | 15.975847 | 3.924238 | 0.831522 | 4.129881 | 0.033454 | 5.143179 |
def pointplot1(df, horizontal, vertical1, vertical2, color1, color2, title, xtitle, ytitle):
rcParams['figure.figsize'] = 19,8
fig, ax = plt.subplots()
sns.pointplot(data=df, x=horizontal, y=vertical1, color=color1);
sns.pointplot(data=df, x=horizontal, y=vertical2, color=color2);
plt.title(title, fontsize=20, y=1.015)
plt.xlabel(xtitle, labelpad=10, fontsize=16)
plt.ylabel(ytitle, labelpad=10, fontsize=16)
ax.set_ylim([0, 15])
DepDelay = mpatches.Patch(color='#bc5090', label='Departure Delay')
ArrDelay = mpatches.Patch(color='#003f5c', label='Arrival Delay')
plt.legend(handles=[DepDelay, ArrDelay]);
pointplot1(df_months, 'Month', 'DepDelay', 'ArrDelay', '#bc5090', '#003f5c',
'Average Flight Departure & Arrival Delays Per Months of the Year',
'Months of the Year', 'Departure & Arrival Delays \n (mins)')
def barplot1(df, horizontal, vertical1, vertical2, color1, color2, title, xtitle, ytitle):
#rcParams['figure.figsize'] = 15,8
fig, ax = plt.subplots()
sns.barplot(data=df, x=horizontal, y=vertical1, color=color1);
sns.barplot(data=df, x=horizontal, y=vertical2, color=color2);
plt.title(title, fontsize=20, y=1.015)
plt.xlabel(xtitle, labelpad=10, fontsize=16)
plt.ylabel(ytitle, labelpad=10, fontsize=16)
DepDelay = mpatches.Patch(color='#bc5090', label='Departure Delay')
ArrDelay = mpatches.Patch(color='#003f5c', label='Arrival Delay')
plt.legend(handles=[DepDelay, ArrDelay]);
rcParams['figure.figsize'] = 15,8
barplot1(df_months, 'Month', 'DepDelay', 'ArrDelay', '#bc5090', '#003f5c',
'Average Flight Departure & Arrival Delays Per Months of the Year',
'Months of the Year', 'Departure & Arrival Delays \n (mins)')
# Violin plot of delays over the various months
rcParams['figure.figsize'] = 14,7
base_color = sns.color_palette()[0]
sns.violinplot(data=flight_df, x='Month', y='DepDelay', color = base_color, inner='quartile')
ticks = [0, 1]
#labels = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'November', 'December']
#plt.xticks(ticks, labels)
plt.title('Violin plot Flight Departure & Arrival Delays Per Months of the Year', fontsize=14, y=1.025)
#plt.xlabel('Months')
plt.ylabel('Departure & Arrival Delays \n (mins)');
The above gives a clear indication that December, July and June are the months with the highest average flight delays and this can be likely be attributed to the summer and christmas holidays within those months.
Also, September, April and May are the months with the least flight delays and this could be as a result of less activities due to weather impacts. All of these are my assumptions
df_weekdays = flight_df.groupby('DayOfWeek').mean().reset_index()
df_weekdays
| DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Monday | 1317.022861 | 1335.340643 | 1459.592118 | 1499.298769 | 2113.347109 | 121.573485 | 125.283221 | 99.933882 | 7.813732 | 9.173509 | 713.874802 | 7.226522 | 15.699925 | 2.902168 | 0.674118 | 3.416074 | 0.020537 | 4.049500 |
| 1 | Tuesday | 1314.316936 | 1335.926443 | 1459.239308 | 1500.759023 | 2109.570694 | 121.262736 | 124.978091 | 99.812517 | 5.745264 | 7.057494 | 711.473489 | 7.076416 | 15.677604 | 2.494479 | 0.574968 | 3.235017 | 0.013905 | 3.194297 |
| 2 | Wednesday | 1315.236585 | 1335.795000 | 1458.084965 | 1500.123132 | 2115.242744 | 121.690585 | 125.041238 | 99.892141 | 6.976287 | 7.900793 | 711.766206 | 7.138367 | 15.981372 | 2.606637 | 0.626894 | 3.495381 | 0.018248 | 3.595516 |
| 3 | Thursday | 1314.522949 | 1334.249562 | 1453.301778 | 1497.411701 | 2114.347324 | 122.336226 | 125.299951 | 100.086957 | 9.242244 | 9.792044 | 713.944826 | 7.314588 | 16.262478 | 2.971074 | 0.709061 | 3.771901 | 0.026650 | 4.367126 |
| 4 | Friday | 1319.247423 | 1336.170666 | 1457.615095 | 1498.845689 | 2112.056353 | 122.105769 | 125.337222 | 100.113084 | 9.767841 | 10.602813 | 714.348522 | 7.249461 | 16.072420 | 3.210932 | 0.672066 | 3.703703 | 0.024578 | 4.625477 |
| 5 | Saturday | 1292.023914 | 1306.411565 | 1447.231150 | 1476.439194 | 2119.360796 | 124.213874 | 129.389332 | 103.935335 | 3.630900 | 6.761147 | 748.832116 | 7.002333 | 14.708930 | 2.863386 | 0.497155 | 2.094488 | 0.027162 | 2.690871 |
| 6 | Sunday | 1348.679978 | 1365.423096 | 1487.851765 | 1528.388815 | 2112.598322 | 122.940337 | 127.109798 | 101.491844 | 7.029750 | 9.027095 | 729.382387 | 7.392733 | 15.394691 | 3.081727 | 0.615432 | 2.993342 | 0.026648 | 3.813552 |
pointplot1(df_weekdays, 'DayOfWeek', 'DepDelay', 'ArrDelay', '#bc5090', '#003f5c',
'Average Flight Departure & Arrival Delays Per Days of the Week',
'Days of the Week', 'Departure & Arrival Delays \n (mins)')
rcParams['figure.figsize'] = 13,8
barplot1(df_weekdays, 'DayOfWeek', 'DepDelay', 'ArrDelay', '#bc5090', '#003f5c',
'Average Flight Departure & Arrival Delays Per Days of the Week',
'Days of the Week', 'Departure & Arrival Delays \n (mins)')
The investigation above showed that Friday, Thursday as well as Sunday and Monday are the days with the most average delayed flights. This could possibly be explained using the weekend factor that sees many people travelling for various reasons.
Saturday has the least average flight delays and this could be as a result of reduced activities at the airports because many people already flew into or out of town the previous day.
df_years = flight_df.groupby('Year').mean().reset_index()
df_years
| Year | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003 | 1323.691785 | 1341.203089 | 1478.229520 | 1510.034870 | 2013.337332 | 120.242552 | 123.783980 | 102.587140 | 3.466596 | 5.106792 | 711.407869 | 7.052982 | 15.085420 | 1.290189 | 0.300377 | 1.824745 | 0.012433 | 1.542409 |
| 1 | 2004 | 1322.454991 | 1340.776257 | 1466.957332 | 1503.653723 | 2091.448403 | 120.927751 | 124.448826 | 100.187119 | 6.300787 | 7.673023 | 713.575697 | 7.843915 | 15.722129 | 2.586037 | 0.696129 | 3.428941 | 0.025138 | 3.437274 |
| 2 | 2005 | 1319.391628 | 1337.953107 | 1460.601715 | 1499.828692 | 2042.720162 | 122.124151 | 125.901508 | 99.156914 | 6.938397 | 8.438173 | 723.711986 | 7.567385 | 15.432282 | 2.935100 | 0.649852 | 3.362485 | 0.019752 | 3.651908 |
| 3 | 2006 | 1317.530385 | 1332.236800 | 1458.198163 | 1495.759524 | 2186.829745 | 123.606857 | 127.180702 | 100.854114 | 8.423315 | 9.841164 | 727.966861 | 7.058027 | 15.744665 | 3.327100 | 0.671558 | 3.581597 | 0.030996 | 4.504022 |
| 4 | 2007 | 1310.362022 | 1330.585286 | 1446.758449 | 1495.386105 | 2188.111007 | 123.301806 | 127.229646 | 100.358097 | 9.809984 | 11.013889 | 719.776516 | 6.691520 | 16.299120 | 3.760127 | 0.755412 | 3.776973 | 0.023737 | 5.087145 |
| 5 | 2008 | 1303.885433 | 1329.992807 | 1442.274740 | 1496.749184 | 2213.874016 | 124.841941 | 129.353448 | 102.016128 | 9.843099 | 10.998441 | 726.141456 | 6.638737 | 16.234712 | 3.756546 | 0.707930 | 3.944971 | 0.019690 | 5.121355 |
pointplot1(df_years, 'Year', 'DepDelay', 'ArrDelay', '#bc5090', '#003f5c',
'Average Flight Departure & Arrival Delays Per Year', 'Year', 'Departure & Arrival Delays \n (mins)')
rcParams['figure.figsize'] = 14,8
barplot1(df_years, 'Year', 'DepDelay', 'ArrDelay', '#bc5090', '#003f5c',
'Average Flight Departure & Arrival Delays Per Year', 'Year', 'Departure & Arrival Delays \n (mins)')
The trends above show a steady increase in the average flight delays from 2003 to 2008. This may require more deepdive to determine the likely reason
#months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'November', 'December']
df_years_month = flight_df.groupby(['Year', 'Month']).mean().reset_index()
df_years_month
| Year | Month | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003 | January | 1327.133358 | 1344.194791 | 1483.070194 | 1513.841902 | 2043.127082 | 119.667812 | 123.395042 | 102.216741 | 2.092595 | 4.119530 | 695.316744 | 6.147034 | 14.903686 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 2003 | February | 1296.773970 | 1344.173659 | 1446.312062 | 1514.659895 | 2040.196388 | 117.998576 | 123.772344 | 101.018114 | 5.690941 | 6.528476 | 699.189504 | 6.078334 | 14.894850 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 2003 | March | 1324.698500 | 1344.886716 | 1478.563053 | 1514.565498 | 2033.316074 | 120.396157 | 124.463076 | 103.571767 | 2.394293 | 4.416930 | 705.059460 | 6.196080 | 14.709935 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 2003 | April | 1326.008457 | 1341.398572 | 1484.908989 | 1510.968953 | 2013.157128 | 118.815218 | 123.522113 | 101.798875 | -0.439172 | 2.853296 | 704.612167 | 7.084952 | 14.144588 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 2003 | May | 1332.006138 | 1341.129253 | 1489.791490 | 1509.864677 | 2002.987071 | 119.818021 | 122.994192 | 101.989649 | 1.548563 | 3.759505 | 705.119012 | 7.402149 | 14.674269 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 61 | 2008 | July | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 62 | 2008 | August | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | 2008 | September | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 64 | 2008 | November | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 65 | 2008 | December | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
66 rows × 20 columns
#Check the entries for 2008 to verify that the data available ended in April
df_years_month.tail(12)
| Year | Month | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 54 | 2007 | December | 1297.895906 | 1329.828098 | 1429.330998 | 1495.515020 | 2185.809293 | 125.277251 | 129.574200 | 101.504975 | 15.413078 | 15.438539 | 727.391660 | 6.921275 | 16.904802 | 5.078789 | 1.094102 | 4.948578 | 0.029544 | 6.936167 |
| 55 | 2008 | January | 1302.991027 | 1330.348545 | 1445.305951 | 1499.847294 | 2229.791959 | 124.347006 | 129.014711 | 101.398557 | 9.765960 | 11.039004 | 720.248472 | 6.676052 | 16.315727 | 3.804765 | 0.694450 | 3.993681 | 0.021318 | 5.188598 |
| 56 | 2008 | February | 1295.064938 | 1330.898974 | 1429.242581 | 1498.168777 | 2230.493929 | 124.366052 | 129.498085 | 101.188426 | 12.407330 | 13.052686 | 724.641295 | 6.657716 | 16.581032 | 4.106717 | 0.934934 | 4.507529 | 0.019962 | 6.122494 |
| 57 | 2008 | March | 1306.837067 | 1331.167723 | 1441.518438 | 1495.877009 | 2204.019986 | 125.897477 | 130.494731 | 103.131987 | 10.728588 | 12.019356 | 733.776072 | 6.676674 | 16.140170 | 3.999607 | 0.766857 | 4.022944 | 0.024541 | 5.408519 |
| 58 | 2008 | April | 1310.145092 | 1327.560043 | 1452.385835 | 1493.158943 | 2192.086309 | 124.708882 | 128.383363 | 102.279939 | 6.568947 | 7.950929 | 725.673696 | 6.543812 | 15.920466 | 3.124127 | 0.444862 | 3.279974 | 0.012783 | 3.804753 |
| 59 | 2008 | May | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 60 | 2008 | June | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 61 | 2008 | July | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 62 | 2008 | August | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 63 | 2008 | September | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 64 | 2008 | November | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 65 | 2008 | December | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
def barplot2(df, horizontal, vertical, hue, color, title, xtitle, ytitle):
sns.barplot(data=df, x=horizontal, y=vertical, hue=hue, color=color)
plt.title(title, fontsize=20, y=1.015)
plt.xlabel(xtitle, labelpad=10, fontsize=16)
plt.ylabel(ytitle, labelpad=10, fontsize=16)
barplot2(df_years_month, 'Year', 'DepDelay', 'Month', '#003f5c',
'Average Flight Departure Delays Per Month Per Year', 'Year', 'Departure Delays \n (mins)')
The plot above supports the previous observation that December, July and June are the months with the highest average flight delays and while September, April and May are the months with the least flight delays as previously observed in the bivariate investigation.
df_month_weekdays = flight_df.groupby(['Month', 'DayOfWeek']).mean().reset_index()
df_month_weekdays
| Month | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | January | Monday | 1312.522275 | 1335.151763 | 1455.060273 | 1501.187315 | 2157.807629 | 121.201392 | 125.414601 | 99.438975 | 9.642931 | 10.801464 | 705.550253 | 7.234072 | 15.744496 | 3.177801 | 0.872818 | 3.752148 | 0.018876 | 4.624529 |
| 1 | January | Tuesday | 1311.351604 | 1337.278361 | 1460.441300 | 1504.487157 | 2161.800431 | 120.933284 | 125.228149 | 99.892607 | 5.711228 | 7.239972 | 703.299133 | 6.745026 | 15.500175 | 2.618715 | 0.530340 | 3.017550 | 0.014977 | 3.088806 |
| 2 | January | Wednesday | 1309.139727 | 1338.163093 | 1456.774019 | 1504.852021 | 2163.736680 | 120.677424 | 125.177639 | 99.618516 | 5.791967 | 7.078977 | 702.683492 | 6.734632 | 15.689416 | 2.416009 | 0.554702 | 2.951289 | 0.013284 | 2.913037 |
| 3 | January | Thursday | 1313.885378 | 1337.158752 | 1457.695695 | 1502.302194 | 2159.534274 | 121.867541 | 125.400508 | 100.248200 | 7.909734 | 8.667676 | 705.412955 | 7.093661 | 16.034743 | 2.605137 | 0.696039 | 3.270501 | 0.013325 | 3.566033 |
| 4 | January | Friday | 1314.406289 | 1337.024761 | 1458.127270 | 1501.666846 | 2151.745786 | 121.618588 | 125.237498 | 100.342829 | 8.362920 | 9.333234 | 704.927108 | 7.004493 | 15.850249 | 2.711208 | 0.636852 | 3.282728 | 0.016160 | 3.753287 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 72 | December | Wednesday | 1310.534695 | 1333.252502 | 1453.870099 | 1499.038618 | 2076.562688 | 123.020510 | 126.739903 | 100.621205 | 10.192524 | 11.037072 | 717.881342 | 7.332553 | 16.112645 | 3.530519 | 0.755086 | 3.990656 | 0.032605 | 4.799107 |
| 73 | December | Thursday | 1303.667301 | 1336.660242 | 1436.284634 | 1500.461512 | 2078.118461 | 123.057604 | 127.159034 | 100.012518 | 13.675848 | 13.950743 | 720.645232 | 7.566818 | 16.329019 | 4.127453 | 0.973543 | 4.971436 | 0.031225 | 6.064957 |
| 74 | December | Friday | 1305.866877 | 1334.246714 | 1435.213456 | 1498.058896 | 2073.937070 | 122.787357 | 127.261190 | 99.990696 | 15.765402 | 16.131351 | 721.486573 | 7.335771 | 16.292121 | 4.663325 | 1.162994 | 5.042106 | 0.034069 | 7.159039 |
| 75 | December | Saturday | 1282.882541 | 1305.942654 | 1432.697393 | 1476.107576 | 2077.978503 | 125.592882 | 131.152329 | 104.104684 | 8.563061 | 10.860425 | 753.889927 | 7.130699 | 15.266735 | 4.147128 | 0.796740 | 3.126160 | 0.035468 | 4.323536 |
| 76 | December | Sunday | 1334.867267 | 1360.797848 | 1472.827970 | 1525.450698 | 2079.714743 | 124.492386 | 128.932392 | 102.013398 | 11.337410 | 12.400792 | 735.225211 | 7.476994 | 15.933598 | 4.108131 | 0.889739 | 4.262951 | 0.035950 | 5.238509 |
77 rows × 20 columns
barplot2(df_month_weekdays, 'Month', 'DepDelay', 'DayOfWeek', '#003f5c',
'Average Flight Departure Delays Per Month For Each Day', 'Month', 'Departure Delays \n (mins)')
rcParams['figure.figsize'] = 20,12
base_color = sns.color_palette()[0]
barplot2(df_month_weekdays, 'DayOfWeek', 'DepDelay', 'Month', base_color,
'Average Flight Departure Delays Per Month For Each Day', 'Day Of Week', 'Departure Delays \n (mins)')
The plots above supports the bivariate analysis that Friday, Thursday as well as Sunday and Monday are the days with the most average delayed flights while Saturday has the least average flight delays.
df_year_month_weekdays = flight_df.groupby(['Year', 'Month', 'DayOfWeek']).mean().reset_index()
df_year_month_weekdays
| Year | Month | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | ... | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2003 | January | Monday | 1324.523482 | 1339.810419 | 1480.476478 | 1508.573875 | 2034.476356 | 118.663817 | 122.529006 | ... | 1.202237 | 3.508300 | 688.506895 | 6.112768 | 14.718978 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 2003 | January | Tuesday | 1320.793013 | 1343.679564 | 1477.546852 | 1512.556635 | 2039.437242 | 117.902568 | 122.122709 | ... | -0.235883 | 2.166080 | 684.796522 | 5.962849 | 14.415956 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 2003 | January | Wednesday | 1331.334714 | 1350.029425 | 1489.735749 | 1519.425824 | 2044.563407 | 119.408968 | 123.107438 | ... | 1.234275 | 3.203038 | 692.506031 | 6.012288 | 14.701373 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 2003 | January | Thursday | 1322.210905 | 1341.950424 | 1475.073189 | 1510.348632 | 2037.089321 | 119.243082 | 122.588584 | ... | 4.137359 | 5.306873 | 688.953315 | 6.290221 | 15.248246 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 2003 | January | Friday | 1325.458781 | 1341.883542 | 1479.796383 | 1510.452766 | 2038.135871 | 119.347702 | 122.642435 | ... | 3.689885 | 5.200977 | 689.454954 | 6.098920 | 15.289333 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 457 | 2008 | December | Wednesday | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 458 | 2008 | December | Thursday | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 459 | 2008 | December | Friday | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 460 | 2008 | December | Saturday | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 461 | 2008 | December | Sunday | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
462 rows × 21 columns
All the above charts supports the origin position previously observed in the bivariate exploration
df_carrier = flight_df.groupby('UniqueCarrier').mean().reset_index().sort_values(['DepDelay'], ascending = False)
df_carrier
| UniqueCarrier | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | EV | 1319.912312 | 1342.454732 | 1458.766423 | 1481.179043 | 4478.586070 | 90.212502 | 96.144802 | 51.184388 | 10.472960 | 13.323556 | 452.395766 | 24.177069 | 14.880416 | 5.622441 | 2.296742 | 3.816344 | 0.019997 | 1.739058 |
| 22 | YV | 1317.866518 | 1363.032082 | 1403.371612 | 1469.609101 | 5282.880517 | 84.916192 | 90.203063 | 62.567974 | 11.063257 | 12.807473 | 399.345941 | 6.721133 | 15.664992 | 6.218178 | 0.556563 | 2.359594 | 0.029354 | 6.047495 |
| 4 | B6 | 1346.107355 | 1370.575493 | 1419.045681 | 1485.415330 | 386.829088 | 185.955246 | 190.677219 | 159.660069 | 8.974078 | 11.097054 | 1214.614417 | 6.023280 | 20.396082 | 2.407927 | 0.391695 | 5.375139 | 0.056120 | 5.757172 |
| 10 | FL | 1347.741878 | 1357.594007 | 1464.919415 | 1494.888471 | 493.315849 | 117.003662 | 119.253137 | 94.038883 | 9.549501 | 10.350610 | 656.409345 | 7.464555 | 15.545443 | 1.793082 | 0.130554 | 4.369197 | 0.000000 | 6.525950 |
| 20 | WN | 1353.465304 | 1352.858075 | 1483.975760 | 1508.005902 | 1316.310875 | 100.289288 | 105.952863 | 86.046146 | 5.021043 | 9.690001 | 601.619347 | 4.129367 | 10.112289 | 1.577399 | 0.321423 | 1.185418 | 0.028438 | 4.695665 |
| 18 | UA | 1310.904243 | 1326.388326 | 1461.776944 | 1507.185284 | 740.602490 | 164.015599 | 168.384593 | 140.257854 | 7.976176 | 9.463667 | 1074.936548 | 6.244798 | 17.554969 | 2.770120 | 0.316542 | 4.004421 | 0.004896 | 4.935282 |
| 1 | AA | 1296.083413 | 1315.069913 | 1484.112991 | 1534.450225 | 1293.790895 | 164.004914 | 168.530576 | 139.345676 | 8.448820 | 9.380361 | 1070.407056 | 8.250518 | 16.819557 | 2.955289 | 0.857904 | 4.098385 | 0.018659 | 4.137520 |
| 3 | AS | 1329.803607 | 1347.052987 | 1454.143346 | 1500.054853 | 333.803436 | 137.411300 | 141.461048 | 118.526703 | 7.710047 | 9.336149 | 872.412051 | 5.894671 | 13.148854 | 4.076260 | 0.181021 | 1.947390 | 0.073984 | 4.895640 |
| 6 | DH | 1312.334752 | 1348.440907 | 1422.156464 | 1482.741270 | 5174.742547 | 83.413520 | 89.110174 | 61.533938 | 6.555868 | 9.291647 | 374.860777 | 5.560162 | 15.918400 | 1.969034 | 0.507145 | 3.062528 | 0.010457 | 4.583911 |
| 13 | MQ | 1272.131389 | 1310.782472 | 1394.785031 | 1454.427001 | 3945.017460 | 83.449369 | 86.923612 | 61.516658 | 9.002291 | 9.141727 | 388.406354 | 6.797384 | 15.380555 | 2.948788 | 0.677848 | 3.531568 | 0.005798 | 5.048129 |
| 0 | 9E | 1302.432670 | 1337.924132 | 1441.759749 | 1500.366445 | 4777.206214 | 94.616258 | 99.131211 | 68.692818 | 7.799498 | 8.973308 | 450.919173 | 8.181124 | 17.808210 | 4.193496 | 0.625921 | 2.932121 | 0.006997 | 4.332753 |
| 15 | OH | 1314.402308 | 1351.847729 | 1422.170069 | 1478.983722 | 5380.295663 | 94.462299 | 99.926300 | 68.545838 | 6.781390 | 8.765387 | 468.282349 | 8.757209 | 17.217001 | 4.494456 | 2.774956 | 3.368056 | 0.019920 | 0.453926 |
| 5 | CO | 1299.426154 | 1302.378622 | 1501.543085 | 1536.181137 | 935.221815 | 177.481846 | 179.296196 | 148.626901 | 8.279627 | 8.363712 | 1120.505658 | 7.404686 | 21.517506 | 2.214105 | 0.554635 | 5.811676 | 0.048666 | 3.247394 |
| 21 | XE | 1289.422289 | 1311.678555 | 1446.090663 | 1486.807348 | 2592.040618 | 105.151526 | 106.971454 | 80.178296 | 8.707737 | 7.979936 | 530.968393 | 6.809519 | 18.239650 | 2.187369 | 0.614977 | 5.160099 | 0.033703 | 4.337780 |
| 19 | US | 1332.888933 | 1354.739746 | 1468.716412 | 1504.378679 | 899.521459 | 128.005702 | 131.364077 | 105.443672 | 6.447901 | 7.804594 | 753.311116 | 5.855164 | 16.744077 | 2.600980 | 0.301620 | 3.416136 | 0.014288 | 3.757863 |
| 16 | OO | 1316.602710 | 1338.640012 | 1436.551059 | 1472.670782 | 5558.712228 | 80.103986 | 83.180511 | 78.878022 | 5.869575 | 7.126035 | 382.502834 | 5.448367 | 13.155473 | 3.685401 | 0.600583 | 1.866365 | 0.030930 | 2.654522 |
| 7 | DL | 1322.863881 | 1341.795584 | 1485.072820 | 1525.299351 | 1157.616510 | 142.642787 | 145.379846 | 116.582408 | 6.725651 | 7.082533 | 866.877842 | 7.601725 | 18.500608 | 2.579539 | 0.269835 | 3.753073 | 0.012010 | 2.823788 |
| 12 | HP | 1354.070217 | 1390.611420 | 1460.120216 | 1492.030525 | 457.459010 | 156.482736 | 161.141679 | 133.871594 | 4.365381 | 6.739646 | 1017.004927 | 6.822894 | 15.811886 | 2.897120 | 0.121672 | 2.269945 | 0.041605 | 1.994382 |
| 9 | F9 | 1341.851354 | 1354.190341 | 1510.894186 | 1557.255913 | 524.382400 | 142.638920 | 144.108327 | 121.431621 | 5.626204 | 6.141948 | 895.842717 | 7.415289 | 13.809887 | 2.427082 | 0.330814 | 3.409315 | 0.014063 | 2.054548 |
| 17 | TZ | 1317.840802 | 1335.844041 | 1461.852663 | 1508.548983 | 1555.351111 | 174.254756 | 175.493141 | 150.023900 | 6.007212 | 5.440283 | 1148.828393 | 7.008421 | 17.235808 | 1.913165 | 0.111014 | 4.182763 | 0.061946 | 3.483947 |
| 14 | NW | 1313.341471 | 1326.420474 | 1482.659655 | 1512.037387 | 958.062323 | 130.071362 | 130.414943 | 105.246898 | 6.742764 | 5.359419 | 755.494846 | 7.448654 | 17.417898 | 3.590880 | 0.718021 | 3.154532 | 0.019057 | 1.927582 |
| 2 | AQ | 1256.337097 | 1281.352643 | 1363.986319 | 1386.780637 | 212.422477 | 72.448163 | 74.632213 | 60.872863 | -1.217197 | 0.049687 | 424.244782 | 3.655532 | 7.922829 | 1.604606 | 0.025653 | 0.177941 | 0.020895 | 1.087748 |
| 11 | HA | 1271.034110 | 1279.720768 | 1385.996087 | 1401.376722 | 219.622229 | 92.875268 | 93.668816 | 78.318391 | -1.479387 | -0.964815 | 601.508990 | 5.326213 | 9.239592 | 1.871333 | 0.031666 | 0.052147 | 0.018656 | 0.728001 |
df_carrier.shape
(23, 19)
carrier_df = pd.read_csv('carriers.csv')
carrier_df.head()
| UniqueCarrier | Description | |
|---|---|---|
| 0 | 02Q | Titan Airways |
| 1 | 04Q | Tradewind Aviation |
| 2 | 05Q | Comlux Aviation, AG |
| 3 | 06Q | Master Top Linhas Aereas Ltd. |
| 4 | 07Q | Flair Airlines Ltd. |
carrier_df.shape
(1491, 2)
carrier_df.rename(columns = {'Code':'UniqueCarrier'}, inplace = True)
carrier_df.head()
| UniqueCarrier | Description | |
|---|---|---|
| 0 | 02Q | Titan Airways |
| 1 | 04Q | Tradewind Aviation |
| 2 | 05Q | Comlux Aviation, AG |
| 3 | 06Q | Master Top Linhas Aereas Ltd. |
| 4 | 07Q | Flair Airlines Ltd. |
carrier_airlines = df_carrier.merge(carrier_df, on = 'UniqueCarrier')
carrier_airlines.head()
| UniqueCarrier | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EV | 1319.912312 | 1342.454732 | 1458.766423 | 1481.179043 | 4478.586070 | 90.212502 | 96.144802 | 51.184388 | 10.472960 | 13.323556 | 452.395766 | 24.177069 | 14.880416 | 5.622441 | 2.296742 | 3.816344 | 0.019997 | 1.739058 | Atlantic Southeast Airlines |
| 1 | YV | 1317.866518 | 1363.032082 | 1403.371612 | 1469.609101 | 5282.880517 | 84.916192 | 90.203063 | 62.567974 | 11.063257 | 12.807473 | 399.345941 | 6.721133 | 15.664992 | 6.218178 | 0.556563 | 2.359594 | 0.029354 | 6.047495 | Mesa Airlines Inc. |
| 2 | B6 | 1346.107355 | 1370.575493 | 1419.045681 | 1485.415330 | 386.829088 | 185.955246 | 190.677219 | 159.660069 | 8.974078 | 11.097054 | 1214.614417 | 6.023280 | 20.396082 | 2.407927 | 0.391695 | 5.375139 | 0.056120 | 5.757172 | JetBlue Airways |
| 3 | FL | 1347.741878 | 1357.594007 | 1464.919415 | 1494.888471 | 493.315849 | 117.003662 | 119.253137 | 94.038883 | 9.549501 | 10.350610 | 656.409345 | 7.464555 | 15.545443 | 1.793082 | 0.130554 | 4.369197 | 0.000000 | 6.525950 | AirTran Airways Corporation |
| 4 | WN | 1353.465304 | 1352.858075 | 1483.975760 | 1508.005902 | 1316.310875 | 100.289288 | 105.952863 | 86.046146 | 5.021043 | 9.690001 | 601.619347 | 4.129367 | 10.112289 | 1.577399 | 0.321423 | 1.185418 | 0.028438 | 4.695665 | Southwest Airlines Co. |
carrier_airlines.shape
(23, 20)
carrier_airlines['Description'].value_counts()
Atlantic Southeast Airlines 1 Continental Air Lines Inc. 1 Aloha Airlines Inc. 1 Northwest Airlines Inc. 1 ATA Airlines d/b/a ATA 1 Frontier Airlines Inc. 1 America West Airlines Inc. (Merged with US Airways 9/05. Stopped reporting 10/07.) 1 Delta Air Lines Inc. 1 Skywest Airlines Inc. 1 US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.) 1 Expressjet Airlines Inc. 1 Comair Inc. 1 Mesa Airlines Inc. 1 Pinnacle Airlines Inc. 1 American Eagle Airlines Inc. 1 Independence Air 1 Alaska Airlines Inc. 1 American Airlines Inc. 1 United Air Lines Inc. 1 Southwest Airlines Co. 1 AirTran Airways Corporation 1 JetBlue Airways 1 Hawaiian Airlines Inc. 1 Name: Description, dtype: int64
carrier_airlines['Description'] = carrier_airlines['Description'].replace('America West Airlines Inc. (Merged with US Airways 9/05. Stopped reporting 10/07.)', 'America West Airlines Inc.')
carrier_airlines['Description'] = carrier_airlines['Description'].replace('US Airways Inc. (Merged with America West 9/05. Reporting for both starting 10/07.)', 'US Airways Inc.')
carrier_airlines['Description'].value_counts()
Atlantic Southeast Airlines 1 Continental Air Lines Inc. 1 Aloha Airlines Inc. 1 Northwest Airlines Inc. 1 ATA Airlines d/b/a ATA 1 Frontier Airlines Inc. 1 America West Airlines Inc. 1 Delta Air Lines Inc. 1 Skywest Airlines Inc. 1 US Airways Inc. 1 Expressjet Airlines Inc. 1 Comair Inc. 1 Mesa Airlines Inc. 1 Pinnacle Airlines Inc. 1 American Eagle Airlines Inc. 1 Independence Air 1 Alaska Airlines Inc. 1 American Airlines Inc. 1 United Air Lines Inc. 1 Southwest Airlines Co. 1 AirTran Airways Corporation 1 JetBlue Airways 1 Hawaiian Airlines Inc. 1 Name: Description, dtype: int64
w = df_carrier['UniqueCarrier'].value_counts().index[:23]
w
Index(['9E', 'HP', 'XE', 'WN', 'US', 'UA', 'TZ', 'OO', 'OH', 'NW', 'MQ', 'HA',
'AA', 'FL', 'F9', 'EV', 'DL', 'DH', 'CO', 'B6', 'AS', 'AQ', 'YV'],
dtype='object')
rcParams['figure.figsize'] = 20,14
fig, ax = plt.subplots()
#w = df_carrier['UniqueCarrier'].value_counts().index[:23]
sns.barplot(data=carrier_airlines, y='Description', x='DepDelay', color = sns.color_palette()[0]);
#sns.barplot(data=df_carrier, x='UniqueCarrier', y='ArrDelay', color='#003f5c');
plt.title('Average Departure Flights Delay By Airlines', fontsize=25, y=1.015)
plt.xlabel('Departure Delays \n (mins)', labelpad=10, fontsize=16)
plt.ylabel('Airlines', labelpad=10, fontsize=16);
#ax.set_ylim([0, 15])
#DepDelay = mpatches.Patch(color='#bc5090', label='Departure Delay')
#ArrDelay = mpatches.Patch(color='#003f5c', label='Arrival Delay')
#plt.legend(handles=[DepDelay, ArrDelay]);
rcParams['figure.figsize'] = 20,11
fig, ax = plt.subplots()
sns.pointplot(data=carrier_airlines, y='Description', x='DepDelay', color='#bc5090');
#sns.pointplot(data=carrier_airlines, y='Description', x='ArrDelay', color='#003f5c');
plt.title('Average Departure Flights Delay By Airlines', fontsize=25, y=1.015)
plt.xlabel('Departure & Arrival Delays \n (mins)', labelpad=10, fontsize=16)
plt.ylabel('Airlines', labelpad=10, fontsize=16)
#ax.set_ylim([0, 15])
DepDelay = mpatches.Patch(color='#bc5090', label='Departure Delay')
ArrDelay = mpatches.Patch(color='#003f5c', label='Arrival Delay')
#plt.legend(handles=[DepDelay, ArrDelay]);
The investigations above show that Atlantic Southwest, Mesa and Jetblue make up the 3 airlines with the most average delayed flights while Hawaiian, Aloha and Northwest represent the 3 airlines with the least or no flight delays.
df_Origin = flight_df.groupby('Origin').mean().reset_index().sort_values(['DepDelay'], ascending = False)
df_Origin.head()
| Origin | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 113 | FMN | 1334.000000 | 997.000000 | 1411.000000 | 1113.666667 | 3874.000000 | 37.000000 | 76.666667 | 24.000000 | 164.000000 | 203.666667 | 148.000000 | 4.666667 | 8.333333 | 0.000000 | 64.333333 | 0.0 | 0.0 | 0.0 |
| 78 | CYS | 1600.000000 | 1375.000000 | 1673.500000 | 1471.500000 | 6598.000000 | 33.500000 | 56.500000 | 20.500000 | 122.000000 | 145.000000 | 90.000000 | 5.000000 | 8.000000 | 0.000000 | 122.000000 | 0.0 | 0.0 | 0.0 |
| 221 | OGD | 1334.000000 | 1097.000000 | 1279.500000 | 1243.666667 | 3786.166667 | 54.666667 | 93.333333 | 43.833333 | 123.666667 | 143.666667 | 28.000000 | 4.666667 | 9.000000 | 45.000000 | 78.666667 | 0.0 | 0.0 | 0.0 |
| 27 | BFF | 1801.000000 | 1550.000000 | 1842.000000 | 1709.000000 | 6301.000000 | 41.000000 | 79.000000 | 194.000000 | 93.000000 | 131.000000 | 152.000000 | 2.000000 | 4.000000 | 0.000000 | 93.000000 | 0.0 | 0.0 | 0.0 |
| 249 | PUB | 1070.666667 | 1424.666667 | 1087.666667 | 1534.666667 | 4795.333333 | 17.000000 | 70.000000 | 8.000000 | 53.000000 | 69.333333 | 60.666667 | 6.000000 | 3.000000 | 33.666667 | 19.333333 | 0.0 | 0.0 | 0.0 |
airport_df = pd.read_csv('airports.csv')
airport_df.head()
| iata | airport | city | state | country | lat | long | |
|---|---|---|---|---|---|---|---|
| 0 | 00M | Thigpen | Bay Springs | MS | USA | 31.953765 | -89.234505 |
| 1 | 00R | Livingston Municipal | Livingston | TX | USA | 30.685861 | -95.017928 |
| 2 | 00V | Meadow Lake | Colorado Springs | CO | USA | 38.945749 | -104.569893 |
| 3 | 01G | Perry-Warsaw | Perry | NY | USA | 42.741347 | -78.052081 |
| 4 | 01J | Hilliard Airpark | Hilliard | FL | USA | 30.688012 | -81.905944 |
airport_df.rename(columns = {'iata':'Origin'}, inplace = True)
airport_name_df = df_Origin.merge(airport_df, on = 'Origin')
airport_name_df['airport_city'] = airport_name_df.airport + ',' + ' ' + airport_name_df.city + ',' + ' ' + airport_name_df.state
airport_name_df
| Origin | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | ... | NASDelay | SecurityDelay | LateAircraftDelay | airport | city | state | country | lat | long | airport_city | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FMN | 1334.000000 | 997.000000 | 1411.000000 | 1113.666667 | 3874.000000 | 37.000000 | 76.666667 | 24.000000 | 164.000000 | ... | 0.000000 | 0.000000 | 0.000000 | Four Corners Regional | Farmington | NM | USA | 36.741250 | -108.229944 | Four Corners Regional, Farmington, NM |
| 1 | CYS | 1600.000000 | 1375.000000 | 1673.500000 | 1471.500000 | 6598.000000 | 33.500000 | 56.500000 | 20.500000 | 122.000000 | ... | 0.000000 | 0.000000 | 0.000000 | Cheyenne | Cheyenne | WY | USA | 41.155723 | -104.811838 | Cheyenne, Cheyenne, WY |
| 2 | OGD | 1334.000000 | 1097.000000 | 1279.500000 | 1243.666667 | 3786.166667 | 54.666667 | 93.333333 | 43.833333 | 123.666667 | ... | 0.000000 | 0.000000 | 0.000000 | Ogden-Hinckley | Ogden | UT | USA | 41.195944 | -112.012175 | Ogden-Hinckley, Ogden, UT |
| 3 | BFF | 1801.000000 | 1550.000000 | 1842.000000 | 1709.000000 | 6301.000000 | 41.000000 | 79.000000 | 194.000000 | 93.000000 | ... | 0.000000 | 0.000000 | 0.000000 | Scotts Bluff County | Scottsbluff | NE | USA | 41.874028 | -103.595639 | Scotts Bluff County, Scottsbluff, NE |
| 4 | PUB | 1070.666667 | 1424.666667 | 1087.666667 | 1534.666667 | 4795.333333 | 17.000000 | 70.000000 | 8.000000 | 53.000000 | ... | 0.000000 | 0.000000 | 0.000000 | Pueblo Memorial | Pueblo | CO | USA | 38.289087 | -104.496572 | Pueblo Memorial, Pueblo, CO |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 320 | EFD | 1127.460929 | 1169.718762 | 1179.604002 | 1215.915440 | 2698.180445 | 31.516421 | 28.861835 | 14.089468 | 1.564364 | ... | 1.001510 | 0.001510 | 0.621744 | Ellington | Houston | TX | USA | 29.607333 | -95.158750 | Ellington , Houston, TX |
| 321 | MKK | 708.413194 | 778.211806 | 754.472222 | 823.336806 | 90.000000 | 19.947917 | 26.930556 | 12.881944 | -7.763889 | ... | 0.000000 | 0.000000 | 0.777778 | Molokai | Kaunakakai | HI | USA | 21.152886 | -157.096256 | Molokai, Kaunakakai, HI |
| 322 | ITO | 1335.866844 | 1346.052738 | 1420.853302 | 1427.250149 | 394.580636 | 49.243120 | 48.590286 | 37.745999 | -2.073627 | ... | 0.074818 | 0.010524 | 1.204718 | Hilo International | Hilo | HI | USA | 19.720263 | -155.048470 | Hilo International, Hilo, HI |
| 323 | LNY | 665.020761 | 705.695502 | 671.584775 | 730.941176 | 91.570934 | 14.543253 | 15.972318 | -77.861592 | -4.564014 | ... | 0.000000 | 0.000000 | 0.000000 | Lanai | Lanai City | HI | USA | 20.785611 | -156.951418 | Lanai, Lanai City, HI |
| 324 | GLH | 1144.500000 | 1150.000000 | 1239.000000 | 1235.000000 | 5803.000000 | 54.500000 | 45.000000 | 35.500000 | 4.000000 | ... | 0.000000 | 0.000000 | 0.000000 | Mid Delta Regional | Greenville | MS | USA | 33.482881 | -90.985614 | Mid Delta Regional, Greenville, MS |
325 rows × 26 columns
df_Origin_max = airport_name_df.nlargest(30, 'DepDelay')
df_Origin_max.head()
| Origin | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | ... | NASDelay | SecurityDelay | LateAircraftDelay | airport | city | state | country | lat | long | airport_city | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FMN | 1334.000000 | 997.000000 | 1411.000000 | 1113.666667 | 3874.000000 | 37.000000 | 76.666667 | 24.000000 | 164.000000 | ... | 0.0 | 0.0 | 0.0 | Four Corners Regional | Farmington | NM | USA | 36.741250 | -108.229944 | Four Corners Regional, Farmington, NM |
| 1 | CYS | 1600.000000 | 1375.000000 | 1673.500000 | 1471.500000 | 6598.000000 | 33.500000 | 56.500000 | 20.500000 | 122.000000 | ... | 0.0 | 0.0 | 0.0 | Cheyenne | Cheyenne | WY | USA | 41.155723 | -104.811838 | Cheyenne, Cheyenne, WY |
| 2 | OGD | 1334.000000 | 1097.000000 | 1279.500000 | 1243.666667 | 3786.166667 | 54.666667 | 93.333333 | 43.833333 | 123.666667 | ... | 0.0 | 0.0 | 0.0 | Ogden-Hinckley | Ogden | UT | USA | 41.195944 | -112.012175 | Ogden-Hinckley, Ogden, UT |
| 3 | BFF | 1801.000000 | 1550.000000 | 1842.000000 | 1709.000000 | 6301.000000 | 41.000000 | 79.000000 | 194.000000 | 93.000000 | ... | 0.0 | 0.0 | 0.0 | Scotts Bluff County | Scottsbluff | NE | USA | 41.874028 | -103.595639 | Scotts Bluff County, Scottsbluff, NE |
| 4 | PUB | 1070.666667 | 1424.666667 | 1087.666667 | 1534.666667 | 4795.333333 | 17.000000 | 70.000000 | 8.000000 | 53.000000 | ... | 0.0 | 0.0 | 0.0 | Pueblo Memorial | Pueblo | CO | USA | 38.289087 | -104.496572 | Pueblo Memorial, Pueblo, CO |
5 rows × 26 columns
rcParams['figure.figsize'] = 19,13
fig, ax = plt.subplots()
sns.barplot(data=df_Origin_max, y='airport_city', x='DepDelay', color=base_color);
#sns.barplot(data=df_carrier, x='UniqueCarrier', y='ArrDelay', color='#003f5c');
plt.title('Airports with the Most Average Delayed Flights', fontsize=25, y=1.015)
plt.ylabel('Airport, City and State', labelpad=10, fontsize=16)
plt.xlabel('Departure Delays \n (mins)', labelpad=10, fontsize=16);
#ax.set_ylim([0, 15])
#DepDelay = mpatches.Patch(color='#bc5090', label='Departure Delay')
#ArrDelay = mpatches.Patch(color='#003f5c', label='Arrival Delay')
#plt.legend(handles=[DepDelay, ArrDelay]);
From the investigation above, four airports (Scotts Bluff, Ogden-Hinckley, Cheyenne and Four Corners Regional) were observed to have average flight delay time to be greater than 2hours
df_Origin_min = airport_name_df.nsmallest(30, 'DepDelay')
df_Origin_min.head()
| Origin | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | ... | NASDelay | SecurityDelay | LateAircraftDelay | airport | city | state | country | lat | long | airport_city | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 324 | GLH | 1144.500000 | 1150.000000 | 1239.000000 | 1235.000000 | 5803.000000 | 54.500000 | 45.000000 | 35.500000 | 4.000000 | ... | 0.000000 | 0.000000 | 0.000000 | Mid Delta Regional | Greenville | MS | USA | 33.482881 | -90.985614 | Mid Delta Regional, Greenville, MS |
| 323 | LNY | 665.020761 | 705.695502 | 671.584775 | 730.941176 | 91.570934 | 14.543253 | 15.972318 | -77.861592 | -4.564014 | ... | 0.000000 | 0.000000 | 0.000000 | Lanai | Lanai City | HI | USA | 20.785611 | -156.951418 | Lanai, Lanai City, HI |
| 322 | ITO | 1335.866844 | 1346.052738 | 1420.853302 | 1427.250149 | 394.580636 | 49.243120 | 48.590286 | 37.745999 | -2.073627 | ... | 0.074818 | 0.010524 | 1.204718 | Hilo International | Hilo | HI | USA | 19.720263 | -155.048470 | Hilo International, Hilo, HI |
| 321 | MKK | 708.413194 | 778.211806 | 754.472222 | 823.336806 | 90.000000 | 19.947917 | 26.930556 | 12.881944 | -7.763889 | ... | 0.000000 | 0.000000 | 0.777778 | Molokai | Kaunakakai | HI | USA | 21.152886 | -157.096256 | Molokai, Kaunakakai, HI |
| 320 | EFD | 1127.460929 | 1169.718762 | 1179.604002 | 1215.915440 | 2698.180445 | 31.516421 | 28.861835 | 14.089468 | 1.564364 | ... | 1.001510 | 0.001510 | 0.621744 | Ellington | Houston | TX | USA | 29.607333 | -95.158750 | Ellington , Houston, TX |
5 rows × 26 columns
rcParams['figure.figsize'] = 19,13
fig, ax = plt.subplots()
sns.barplot(data=df_Origin_min, y='airport_city', x='DepDelay', color=base_color);
#sns.barplot(data=df_carrier, x='UniqueCarrier', y='ArrDelay', color='#003f5c');
plt.title('Airports with the Least Average Delayed Flights', fontsize=25, y=1.015)
plt.ylabel('Airport, City and State', labelpad=10, fontsize=16)
plt.xlabel('Departure Delays \n (mins)', labelpad=10, fontsize=16);
#ax.set_ylim([0, 15])
#DepDelay = mpatches.Patch(color='#bc5090', label='Departure Delay')
#ArrDelay = mpatches.Patch(color='#003f5c', label='Arrival Delay')
#plt.legend(handles=[DepDelay, ArrDelay]);
The chart above gives an indication of the airports with the least flight delays and it shows that 9 airports all had average departure delays less than zero which is tells a possible story of how efficient services are in those airports